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ABSTRACT 

In text mining, information retrieval, and machine learning, 
text documents are commonly represented through variants 
of sparse Bag of Words (sBoW) vectors {e.g. TF-IDF [l]). 
Although simple and intuitive, sBoW style representations 
suffer from their inherent over-sparsity and fail to capture 
word-level synonymy and polysemy. Especially when labeled 
data is limited {e.g. in document classification), or the text 
documents are short {e.g. emails or abstracts), many fea- 
tures are rarely observed within the training corpus. This 
leads to overfitting and reduced generalization accuracy. In 
this paper we propose Dense Cohort of Terms (dCoT), an 
unsupervised algorithm to learn improved sBoW document 
features. dCoT explicitly models absent words by removing 
and reconstructing random sub-sets of words in the unla- 
beled corpus. With this approach, dCoT learns to recon- 
struct frequent words from co-occurring infrequent words 
and maps the high dimensional sparse sBoW vectors into 
a low-dimensional dense representation. We show that the 
feature removal can be marginalized out and that the recon- 
struction can be solved for in closed-form. We demonstrate 
empirically, on several benchmark datasets, that dCoT fea- 
tures significantly improve the classification accuracy across 
several document classification tasks. 



*A modified version of our CIKM 2012 paper: From sBoW 
to dCoT, Marginalized Encoders for Text Representation 
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I. INTRODUCTION 

The feature representation of text documents plays a criti- 
cal role in many applications of data-mining and information 
retrieval. The sparse Bag of Words (sBoW) representation 
is arguably one of the most commonly used and effective 
approaches. Each document is represented by a high di- 
mensional sparse vector, where each dimension corresponds 
to the term frequency of a unique word within a dictionary 
or hash-table 2 . A natural extension is TF-IDF .Ij, where 
the term frequency counts are discounted by the inverse- 
document-frequencies. Despite its wide-spread use with text 
and image data [3] , sBoW does have some severe limitations, 
mainly due to its often excessive sparsity. 

Although the Oxford English Dictionary contains approx- 
imately 600, 000 unique words, it is fair to say that the 
essence of most written text can be expressed with a far 
smaller vocabulary (e.g. 5000— 10000 unique words). For ex- 
ample the words splendid, spectacular, terrific, glorious, re- 
splendent are all to some degree synonymous with the word 
good. However, as sBoW does not capture synonymy, a doc- 
ument that uses "splendid" will be considered dissimilar from 
a document that uses the word "terrific". A classifier, trained 
to predict the sentiment of a document, would have to be 
exposed to a very large set of labeled examples to learn that 
all these words are predictive towards a positive sentiment. 

In this paper, we propose a novel feature learning algo- 
rithm that directly addresses the problems of excessive spar- 



sity in sBoW representations. Our algorithm, which we refer 
to as Dense Cohort of Terms (dCoT) , maps high-dimensional 
(overly) sparse vectors into a low-dimensional dense repre- 
sentation. The mapping is trained to reconstruct frequent 
from infrequent words. The training process is entirely un- 
supervised, as we generate training instances by randomly 
and repeatedly removing common words from text docu- 
ments. These removed words are then reconstructed from 
the remaining text. In this paper we show that the feature 
removal process can be marginalized out and the reconstruc- 
tion can be solved for in closed form. The resulting algo- 
rithm is a closed-form transformation of the original sBoW 
features, which is extremely fast to train (on the order of 
seconds) and apply (milliseconds). 

Our empirical results indicate that dCoT is useful for sev- 
eral reasons. First, it provides researchers with an efficient 
and convenient method to learn better feature representa- 
tion for sBoW documents, and can be used in a large va- 
riety of data-mining, learning and retrieval tasks. Second, 
we demonstrate that it clearly outperforms existing docu- 
ment representations [Tl|4l[5] on several classification tasks. 
Finally, it is much faster than most competing algorithms. 

2. RELATED WORK 

Over the years, a great number of models have been de- 
veloped to describe textual corpora, including vector space 
models [i [3 El [10] , and topic models g] [ill Ull [S] ■ Vec- 
tor space models reduce each document in the corpus to a 
vector of real numbers, each of which reflects the counts of 
an unordered collection of words. Among them, the most 
popular one is the TF-IDF scheme [H], where each dimen- 
sion of the feature vector computes the term frequency count 
factored by the inverse document frequency count. By down- 
weighting terms that are common in the entire corpus, it ef- 
fectively identifies a subset of terms that are discriminative 
for documents in the corpus. Though simple and efficient, 
TF-IDF reveals little of the correlations between terms, thus 
fails to capture some basic linguistic notions such as syn- 
onymy and polysemy. Latent Semantic Index (LSI) 5 at- 
tempts to overcome this. It applies Singular Value Decom- 
position (SVD) [14] to the TF-IDF (or sBoW) features to 
find a so-called latent semantic space that retains most of 
the variances in the corpus. Each feature in the new space is 
a linear combination of the original TF-IDF features, which 
naturally handles the synonymy problem. 

Topic modeling develops generative statistical models to 
discover the hidden "topic" that occur in the corpus. Proba- 
bilistic LSI [TT], which is proposed as an alternative to LSI, 
models each document as a mixture of a fixed set of topics, 
and each word as a sample generated from a single topic. 
The limitation of probabilistic LSI is that the mixture of 
topics is modeled explicitly for each training data using a 
large set of individual parameters, hence, there is no nat- 
ural way to assign probabilities to unseen documents. La- 
tent Dirichlet Allocation (LDA) [3] solves the problem by 
introducing a Dirichlet prior on the topic distribution, and 
treating the mixing weights as multinomial distributed ran- 
dom variables. It is probably the most commonly used topic 
models nowadays, and the posterior Dirichlet parameters are 
often used as the low dimensional representation for various 
tasks [4]. [15] use non-linear dimensionality reduction [16] 
to embed text data into a low dimensional space, while pre- 
serving pair-wise distances between documents. It is fair to 



say that their approach is computationally most demanding. 
Similarly to LSI, pLSI and LDA, our algorithm also maps 
the sparse sBoW features into a low dimensional dense rep- 
resentation. However it is faster to train and addresses the 
problem of synonymy more explicitly. 

3. dCoT 

First, we introduce notations that will be used through- 
out the paper. Let D = {wi, ■ ■ • ,Wd} be the dictionary of 
words that appear in the text corpus, with size d = \D\. 
Each input document is represented as a vector x G TV^ , 
where each dimension Xj counts the appearance of word Wj 
in this document. Let X = [xi, ■ ■ ■ ,x„] denote the corpus. 
Assume that the first ni <^ n documents are accompanied 
by corresponding labels {yi , ■ ■ ■ , j/n, } £ V, drawn from some 
joint distribution "D. 

In this section we introduce the algorithm dCoT, which 
translates sparse sBoW vectors x £ TV' into denser and lower 
dimensional prototype vectors. We first define the concept 
of prototype terms and then derive the algorithm step-by- 
step. 

Prototype features. Let P — {wpj , ■ ■ ■ C D, 

with \P\ = r and r <^ d, denote a strict subset of the vo- 
cabulary D, which we refer to as prototype terms. Our algo- 
rithm aims to "translate" each term in D into one or more of 
these prototype words with similar meaning. Several choices 
are possible to identify P, but a typical heuristic is to pick 
the r most frequent terms in D. The most frequent terms 
can be thought of as representative expressions for sets of 
synonyms — e.g. the frequent word good represents the 
rare words splendid, spectacular, terrific, glorious. For this 
choice of P, dCoT translates rare words into frequent words. 

Corruption. The goal of dCoT is to learn a mapping 
W : T?,'* — >■ TZ^ , which "translates" the original sBoW vec- 
tors in TZ'' into a combination of prototype terms in TZ'^. 
Our training of W is based on one crucial insight: If a pro- 
totype term already exists in some input x, W should be 
able to predict it from the remaining terms in x. We there- 
fore artificially create a supervised dataset from unlabeled 
data by removing (i.e. setting to zero) each term in x with 
some probability (1 — p). We perform this m times and refer 
to the resulting corrupted vectors as x^,...,x'". We not 
only remove prototype features but all features, to generate 
more diverse input samples. (In the subsequent section we 
will show that in fact we never actually have to create this 
corrupted dataset, as its creation can be marginalized out 
entirely — but for now let us pretend it actually exists.) 

Reconstruction. In addition to the corruptions, for each 
input Xi we create a sub- vector Xi = [^^pi, • • • ,Xp^]^ € TZ'^ 
which only contains its prototype features. A mapping W € 
TZ^^'' is then learned to reconstructs the prototype features 
from the corrupted version Xi, by minimizing the squared 
reconstruction error, 

n m 

— EEii--wx^ir (1) 

To simplify notation, we assume that a constant feature is 
added to the corrupted input, Xi = [x^; 1], and an appropri- 
ate bias is incorporated within the mapping W = [W, b]. 
Note that the constant feature is never corrupted. The bias 
term has the important task of reconstructing the average 
occurrence of the prototype features. 



Let us define a design matrix 

X = [Xl, • ■ • ,Xl, • ■ ■ ,Xn, • ■ • ,X,i] G U'' 



as the m copies of the prototype features of the inputs. Sim- 
ilarly, we denote the m corruptions of the original inputs as 



X 



^ m 



tation, the loss in eq. ^ reduces to 



^ 1X- 
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WX 



With this no- 



(2) 



where || • \\p denotes the squared Frobenius norm. The solu- 
tion to ((2]) can be obtained under closed-form as the solution 
to the well-known ordinary least square. 



W = RQ"^ with Q = XX ' and R = XX ' 



(3) 



Marginalized corruption. Ideally, we would like 
the number of corrupted versions become very large, i.e. 
m — >■ oo. By the weak law of large numbers, R and Q then 
converge to their expectations and ^ becomes 

W = E[R]E[Q]-\ (4) 

with the expectations of R and Q defined as 

n n 

E[Q] = J2 E[^^^J], E[R] = J2 ^[x.x7]. (5) 



The uniform corruption allows us to compute the expec- 
tations in ^ in closed form. Let us define a vector q — 
[p, . . . ,p,l]^ G 7?.'*^^, where qa indicates if feature a survive 
a corruption (the constant feature is never corrupted, hence 
qd+i = 1) ■ If we denote the scatter matrix of the uncorrupted 
input as 5* = XX^, we obtain E[R]ai3 = Sa^qa and 



E[QU = 



Sa/3qaq/3 
Sa/3qa 



if 
if 



a = 13 



(6) 



The diagonal entries of E[Cl] are the product of two identi- 
cal features, and the probability of a feature surviving cor- 
ruption is p. The expected value of the diagonal entries 
is therefore the scatter matrix multiplied by p. The off- 
diagonal entries are the product of two different features a 
and /3, which are corrupted independently. The probability 
of both features surviving the corruption is p^. 

Squashing function. The output of the linear mapping 
W : TZ'' TiT approximates the expected value [17] of 
a prototype term. It can be beneficial to have more bag- 
of-word like features that are either present or not. For 
this purpose, we apply the tanh() squashing-function to the 
output 



z = tanh(Wx), 



(7) 



which has the effect of amplifying or dampening the feature 
values of the reconstructed prototype words. We refer to 
our feature learning algorithm as dCoT (Dense Cohort of 
Terms). 

3.1 Recursive re-application 

The linear mapping in eq.Q is trained by reconstruct- 
ing prototype words from partially corrupted input vectors. 
This linear approach works well for prototype words that 
commonly appear together with words of similar meaning 
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Figure 1: Schematic layout of dCoT. The left 
part illustrates that dCoT learns a mapping from 
the overly sparse BoW representation to a dense 
one. The right part illustrates the recursive re- 
application to reconstruct prototype features from 
the context. 



{e.g. "president" and "obama"), as the mapping captures the 
correlation between the two. It can however be the case that 
two synonyms never appear together because the input doc- 
uments are short and the authors use one term or the other 
but rarely both together {e.g. "tasty" and its rarer synonym 
"delicious") . In these cases it can help to recursively re-apply 
dCoT to its own output. Here, the first mapping recon- 
structs a common context between synonyms {i.e. words 
that co-occur with all synonyms) and subsequent applica- 
tions of dCoT reconstruct the synonym-prototypes from this 
context. In the previous example, one could imagine that 
the first application of dCoT constructs context prototype 
words like "food", "expensive", "dinner", "wonderful" from 
the original term "delicious". The re-application of dCoT 
reconstructs "tasty" from these context words. 

Let the mapping from eq.Q be £ TV^"^ and zj = 
tanh(W'^Xi), for an input Xi. We now compute a second 
mapping G TiT^"^ , exactly as defined in the previous sec- 
tion, except that we consider the vectors z^, . . . , z^ G as 
input. The mapping is an affine transformation which 
stays within the prototype space spanned by P . This process 
can be repeated many times and because the input dimen- 
sionality is low the computation of (O is cheap. Figure [T] 
illustrates this process in a schematic layout. 

If dCoT is applied / times, the final representation z^ is the 
concatenated vector of all outputs and the original input. 



_ / 1 

— I^Xi, Zj , • 



,Zi) • 



(8) 



4. CONNECTION 

dCoT shares some common elements with previously pro- 
posed feature learning algorithms. In this section, we discuss 
their similarities and differences. 

Stacked Denoising Autoencoder (SDA). In the field 
of image recognition, the Autoencoder [18] and the Stacked 
Denoising Autoencoder (SDA) fT9] are widely used to learn 
better feature representation from raw pixels input. dCoT 
shares several core similarities with SDA, which in fact in- 
spired its original development. Similar to dCoT, SDA first 
corrupts the raw input, and learns to re-construct it. SDA 
also stacks several layers together by feeding the output of 
previous layers as input into sub-sequent layers. However, 



constant 



prototype terms 



non-prototype terms 



z6ro v6ctor 


reagan 


nasdaQ 


bush 


union 


Colorado 


reproduction 


budapest 


rescues 


year 


reaQon 


nasdac] 


president 




Colorado 


crop 


currency 


banking 


billion 


house 


national 


george 


soviet 


service 


areas 


talks 


insurance 


dirs 


administration 




reagan 


workers 


states 


weather 


finance 








system 


house 








hungary 


deposit 


share 


president 


exchange 


white 


contract 


kansas 


dry 


central 


deposits 


marl<et 


congress 


association 


secretary 


united 


agreement 


moisture 


bank 


federal 


banl< 


senate 


stocl< 


vice 


employees 


association 


normal 


senior 


bill 


interest 


bill 


securities 


political 


wage 


federal 


good 


newspaper 


Institutions 


price 


states 


trading 


chief 


members 


Oklahoma 


agriculture 


contracts 


mortgage 


debt 


united 


common 


senate 


moscow 


approval 


winter 


financial 


reserve 



Figure 2: Term reconstruction from the Reuters dataset. Each column shows a different input term (e.g. 
"reagan", "nasdaq"), along with the prototype terms reconstructed from this particular input in decreasing 
order of feature values (top to bottom). The very left column shows the prototype terms generated by an 
all-empty input document. 



the two algorithms also have substantial differences. The 
mapping in dCoT is a linear mapping from input to output 
(with a sub-sequent application of tanh()), which is solved 
in closed form. In contrast, SDA employs non-linear map- 
ping from the input to a hidden layer and then to the out- 
put. Instead of a closed-form solution, it requires extensive 
gradient-descent-type hill-climbing. Further, SDA actually 
corrupts the input and is trained with multiple epochs over 
the dataset, whereas dCoT marginalizes out the corruption. 
In terms of running time, dCoT is orders of magnitudes 
faster than SDA and scales to much higher dimensional in- 
puts [201 [21]. 

Principle Component Analysis (PC A). Similar to 
dCoT, Principle Component Analysis (PCA) [25] learns a 
lower dimensional linear space by minimizing the reconstruc- 
tion error of the original input. For text documents, PCA 
is widely known through its variant as latent semantic in- 
dexing (LSI) [5]. Although both dCoT and LSI minimize 
reconstruction errors, the exact optimization is quite dif- 
ferent. dCoT explicitly reconstructs prototype words from 
corruption, whereas LSI minimizes the reconstruction error 
after dimensionality reduction. 

5. RESULTS 

We evaluate our algorithm on Reuters and Dmoz datasets 
together with several other algorithms for feature learning. 

Datasets. The Reuters-21578 dataset is a collection of 
documents that appeared on Reuters newswire in 1987. We 
follow the convention of [23] , which removes documents with 
multiple category labels. The dataset contains 65 categories, 
and consists of 5946 training and 2347 testing documents. 
Each document is represented by sBoW representation with 
18933 distinct terms. The Dmoz dataset is a hierarchical 
collection of webpage links. The top level of the hierarchy 
consists of 16 categories. Following the convention of [24| . 
we labeled each input by its top-level category, and remove 
some low- frequent terms. As a result, the dataset contains 
7184 and 1796 training and testing points respectively, and 
each input is represented by the sBoW representation that 
contains 16498 distinct terms. 

Reconstruction. Figure [2] shows example input terms 
(essentially one-word documents) and the prototype words 
that are reconstructed with dCoT on the Reuters dataset. 
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Figure 3: Classification accuracy trend on Reuters 
dataset with different layers and noise levels. 



Each column represents a different input term {e.g. a doc- 
ument consisting of only the term "nasdaq") and shows the 
reconstructed prototype terms in decreasing order of their 
feature values (top to bottom). The very left column shows 
the prototype features generated by an all-empty input doc- 
ument. These features are completely determined by the 
constant bias, and coincide with the most frequent proto- 
type terms in the whole corpora. For all other columns, we 
subtract this bias-generated vector to highlight the proto- 
type words generated by the actual word and not the bias. 
As shown in the figure, two trends can be observed. First, 
prototype terms are reconstructed by other less common and 
more specific terms. For example, president is reconstructed 
by reagan and bush, and stock is reconstructed by nasdaq. 
Both reagan and bush are specific terms describing pres- 
ident. This trend indicates that dCoT learns the mapping 
from rare terms to common terms. Second, context and top- 
ics are reconstructed from rarer terms through the recursive 
re-application. For example, agriculture is reconstructed by 
reproduction, indicating that documents containing repro- 
duction typically discuss topics related to agriculture. This 
connection indicates that dCoT also learns the higher order 
correlations between terms and topics. 
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Figure 4: Semi-supervised learning results on the Reuters (left) and Dmoz (right) datasets. On both datasets, 
dCoT out-performs all other algorithms, especially when the number of labeled inputs is relatively small. 



Parameter sensitivity. We also evaluate the effect of 
different noise level and number of layers {i.e. the number of 
recursive re-applications). Figure [3] shows the classification 
results on Reuters dataset as a function of layers I and noise 
level 1 — p. After training of dCoT (on the whole dataset), 
we randomly select 1,000 labeled training inputs, train an 
SVM classifier flT] on the new feature representation, and 
test on the full testing set. Two trends emerge: 1. deep 
layers I > 1 improve over a single layered transformation — 
supporting our hypothesis that as we recursively re-apply 
dCoT, not only the feature representation is enriched, but 
also the higher order correlations between terms and topics 
are learned. 2. best results are obtained with a surprisingly 
high level of noise. We explain this trend by the fact that 
more corruption helps discover more subtle relationships be- 
tween features and as we operate in the limit, and integrate 
out all possible corruptions, we can still learn even from 
substantially shortened documents. 

Semi-supervised Experiments. In many real-world 
applications, the labeled training inputs are limited, because 
labeling usually involves human interaction and is expensive 
and time-consuming. However, unlabeled data is usually 
large and available. In this experiment we evaluate the suit- 
ability of dCoT to take advantage of semi-supervised learn- 
ing settings. We learn the new feature representation with 
dCoT on the full training set (without labels), but train a 
linear SVM classifier on a small subset of labeled examples. 
We gradually increase the size of the labeled subset and 
evaluate on the whole testing set. For any given number of 
labeled training inputs, we average over five runs (of ran- 
domly picked labeled examples). We use the validation set 
to select the best combination of noise level and the number 
of layers. 

As baselines, we compare against several alternative fea- 
ture representations, which are all obtained from the full 
training set, similarly applied to a linear SVM classifier. 
The most basic baselines are the sBoW representation (with 
term frequency counts) and TF-IDF [T]. We compute the TF 
for each document separately, and obtain the IDF from the 
whole training set (including labeled and unlabeled data). 
We then apply the same IDF to the testing set. We also com- 
pare against latent semantic indexing (LSI) [5], for which 
we further split the training set into training and valida- 
tion. We use the validation set to find the best parameter 
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Table 1: Running time required for unsupervised 
feature learning with different algorithms. 



(numbers of leading Eigenvectors) , and retrain on the whole 
training set with the best parameter. The new representa- 
tion is obtained by projecting the sBoW feature space onto 
the LSI eigenvectors. Finally, we also compare against La- 
tent Dirichlet Allocation (LDA) J^^. Similar to LSI, we use a 
validation set to find the best parameters, which include the 
Dirichlet hyper-parameter and the number of topics. The 
new representation learned from LDA are the topic mixture 
probabilities. 

The classification results are presented in figure |4] The 
graph shows that on both Dmoz and Reuters datasets, dCoT 
generally out-performs all other algorithms. This trend is 
particularly prominent in settings with relatively little la- 
beled training data. 

Running time. Table [T] compares the running times for 
feature learning with different algorithms. All timings are 
performed on a desktop with dual Intel^^^Six Core Xeon 
X5650 2.66GIIz processors. Compared to LDA and LSI, the 
timing results show a three orders of magnitude speed-up on 
two datasets, reducing the feature learning time from several 
hours to a few minutes. 

6. CONCLUSION 

In this paper we present dCoT, an algorithm that ef- 
ficiently learns a better feature representation for sBoW 
document data. Specifically, dCoT learns a mapping from 
high dimensional sparse to low dimensional dense represen- 
tations by translating rare to common terms. Recursive re- 
application of dCoT on its own output results in the discov- 
ery of higher order topics from raw terms. On two standard 
benchmark document classification datasets we demonstrate 
that our algorithm achieves state-of-the-art results with very 
high reliability in semi-supervised settings. 
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